PMLB v1.0: an open source dataset collection for benchmarking machine learning methods

PMLB (Penn Machine Learning Benchmark) is an open-source data repository containing a curated collection of datasets for evaluating and comparing machine learning (ML) algorithms. Compiled from a broad range of existing ML benchmark collections, PMLB synthesizes and standardizes hundreds of publicly available datasets from diverse sources such as the UCI ML repository and OpenML, enabling systematic assessment of different ML methods. These datasets cover a range of applications, from binary/multi-class classification to regression problems with combinations of categorical and continuous features. PMLB has both a Python interface (pmlb) and an R interface (pmlbr), both with detailed documentation that allows the user to access cleaned and formatted datasets using a single function call. PMLB also provides a comprehensive description of each dataset and advanced functions to explore the dataset space, allowing for smoother user experience and handling of data. The resource is designed to facilitate open-source contributions in the form of datasets as well as improvements to curation.

Institute for Biomedical Informatics, University of Pennsylvania, Philadelphia, PA 19104 • Funded by National Institutes of Health Grant Nos.LM010098 and AI116794.

PMLB (Penn Machine Learning Benchmark
) is an open source data repository containing a curated collection of datasets for evaluating and comparing machine learning (ML) algorithms.Compiled from a broad range of existing ML benchmark collections, PMLB synthesizes and standardizes hundreds of publicly available datasets from diverse sources such as the UCI ML repository and OpenML [1], enabling systematic assessment of di erent ML methods.These datasets cover a range of applications, from binary/multi-class classi cation to regression problems with combinations of categorical and continuous features.PMLB has both a Python interface ( pmlb ) and an R interface ( pmlbr ), both with detailed documentation that allows the user to access cleaned and formatted datasets using a single function call ( fetch_data ).PMLB also provides a comprehensive description of each dataset and advanced functions to explore the dataset space such as nearest_datasets and filter_datasets , which allow for smoother user experience and handling of data.The resource is designed to facilitate open source contributions in the form of datasets as well as improvements to curation.

Statement of need
Benchmarking is a standard practice to illustrate the strengths and weaknesses of algorithms with regards to di erent problem characteristics.In ML, benchmarking often involves assessing the performance of speci c ML models -namely, how well they predict labels for new samples (supervised learning) or how well they organize and/or represent data with no pre-existing labels (unsupervised learning).The extent to which ML methods achieve these aims is typically evaluated over a group of benchmark datasets [2,3].PMLB was designed to provide a suite of such datasets with uniform formatting, as well as the framework for conducting automatic evaluation of the di erent algorithms.
The original release of PMLB (v0.2) [4] received positive feedback from the ML community, re ecting the pressing need for a collection of standardized datasets to evaluate models without intensive preprocessing and dataset curation.As the repository becomes more widely used, community members have requested new features such as additional information about the datasets, as well as new functions to select datasets given speci c criteria.In this paper, we review the original functionality and present new enhancements that facilitate a uid interaction with the repository, both from the perspective of database contributors and end-users.

New datasets with rich metadata
Since its previous major release, v0.2 [4], we have made substantial improvements in the collection of new datasets as well as other helpful supporting features.PMLB now has a new repository structure that includes benchmark datasets for regression problems (Fig. 1).To ful ll requests made by several users, each dataset also includes a metadata.yamlle that contains general descriptive information about the dataset itself (an example can be viewed here).Speci cally, for each dataset, the metadata le includes a web address to the original source of the dataset, a text description of the dataset's purpose, the publication associated with the dataset generation, the type of learning problem it was designed for (i.e., classi cation or regression), keywords (e.g., "simulation", "ecological", "bioinformatics"), and a description of individual features and their coding schema (e.g., 'nonpromoter'= 0, 'promoter'= 1).Metadata les are supported by a standardized format that is formalized using JSON-Schema (version draft-07 ) [5] -upcoming releases of PMLB will include automated validation of datasets and metadata les to further improve ease of contribution and data accuracy.A number of open source contributors have been invaluable in providing manually-curated metadata.In addition, contributors' careful examination have led to important bug xes, such as a correction to the target column in the bupa dataset.

User-friendly interfaces
On PMLB's home page, users can now browse, sort, lter, and search datasets from a lookup table of datasets with summary statistics (Fig. 2).To select datasets with numerical values for speci c metadata characteristics (e.g., number of observations, number of features, class balance, etc.), one can type ranges in the box at the bottom of each numeric column in the format low ... high .For example, if the user wants to view all classi cation datasets with 80 to 100 observations, they would select classification at the bottom of the Task column, and type 80 ... 100 at the bottom of the n_observations column.The CSV button allows the user to download the table's contents with any active lters applied.On the website, we have also published a concise contribution guide with step-by-step instructions on how to add new datasets, submit edits for existing datasets, or improve the provided Python or R code.When a new dataset is added, summary statistics (e.g., number of observations, number of classes, etc.) are automatically computed, a pro ling report is generated (see below), a corresponding metadata template is added to the dataset folder, and PMLB's list of available dataset names is updated.Other checks included in the continuous integration work ow help reduce the amount of work required from both contributors and code reviewers.
In addition to the Python interface for PMLB, we have included an R library that originated from a separate repository that is currently unmaintained.However, because its source code was released under the GNU General Public License, version 2, we were able to adapt the code to make it compatible with the new repository structure in this release and o er additional functionality.The R library also includes a number of detailed "vignette" documents to help new users learn how to use the software.
PMLB now includes original data rows with missing data (i.e., NA).The new version of PMLB also allows the user to select datasets most similar to one of their own using the nearest_datasets function.Here, the similarity between datasets is con gurable to any number of metadata characteristics (e.g., number of samples, number of features, number of target classes, etc.).This functionality is helpful for users who wish to nd PMLB datasets with similar characteristics to their own in order to test or optimize methods (e.g., hyperparameter tuning) for their desired problem without the risk of over-tting to their dataset.API reference guides that detail all user-facing functions and variables in PMLB's Python and R libraries is included on the PMLB website.

Pandas pro ling reports
For each dataset, we use pandas-profiling to generate summary statistic reports.In addition to the descriptive statistics provided by the commonly-used pandas.describe(Python) [6] or skimr::skim (R) functions, pandas-profiling gives a more extensive exploration of the dataset, including correlation structure within the dataset and agging of duplicate samples.Browsing a report allows users and contributors to easily assess dataset quality and make any necessary changes.For example, if a feature is agged by pandas-profiling as having a single value replicated in all samples, it is likely that this feature is uninformative for ML analysis and should be removed from the dataset.
The pro ling reports can be accessed by clicking on the dataset name in the interactive data table or the data point in the interactive chart on the PMLB website.Alternatively, all reports can be viewed on the repository's gh-pages branch, or generated manually by users on their local computing resources.

Space e ciency
We have signi cantly reduced the size of the PMLB source repository by using Git Large File Storage (LFS) to e ciently track changes in large database source les [7].Users who would like to interact with the entire repository (including the complete database sources) locally can do so by either installing Git LFS and cloning the PMLB repository, or by downloading a ZIP archive of the repository from GitHub in a web browser.

Figure 1 :
Figure 1: Characteristics of datasets in the PMLB collection

Figure 2 :
Figure 2: Dataset summary statistics table, with advanced searching, ltering, and sorting features